Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

132

Applications in Natural Language Processing

5.6

Outlier Suppression: Pushing the Limit of Low-Bit Trans-

former Language Models

Wei et al. [243] propose a new method to suppress the outliers existing in the language

models and thus pushes the 6-bit post-training quantization (PTQ) and 4-bit quantization-

aware training (QAT) accuracy of BERT to the full-precision level.

Previous works [17, 165] indicate that the Transformer-based models hold signiﬁcantly

large outliers (even close to 100). Moreover, these extreme outliers behave in structured

patterns. That is, they mainly gather at a few embedding dimensions and even become

larger on unique tokens. Due to these special outliers that can devastate the quantization

performance, the existing method [17] chooses to bypass solutions such as a ﬁner quanti-

zation granularity. However, this ﬁner quantization granularity increases computation cost

and unavoidably hinders the acceleration eﬀect. In contrast, Wei et al. propose to suppress

the outliers rather than walk around them. At ﬁrst, an in-depth analysis is provided to

investigate the inducement of the outliers and the impact of clipping the outliers.

5.6.1

Analysis

Speciﬁcally, the analysis presents two ﬁndings: (1) the scaling parameter in LayerNorm

ampliﬁes the outliers from embedding dimensions and (2) when clipping the outliers and

evaluating the ﬁnal performance, the importance of outliers is highly varied. For the ﬁrst

ﬁnding, the scaling parameter γ in the LayerNorm structure works as an outlier ampliﬁer,

which ampliﬁes the outliers in the output. For token t at j-th embedding dimension, the

LayerNorm is deﬁned as follows:

˜Xt,j = ^X^t,j⁻^μ^t

σ²

t ⁺^ϵ

· γj + βj,

(5.13)

where μt and σ²

t ^{are the mean and variance of token}^t^{, respectively. Then, by observing the}

formula of LayerNorm, the multiplier γ plays a crucial part in amplifying the magnitude of

the token t, as shown in Fig. 5.8 Thus, they propose to remove the ampliﬁcation eﬀect by

extracting γ from Eq. (5.13) and use the Non-scaling LayerNorm Eq. (5.14):

X^′

t,j ⁼^X^t,j⁻^μ^t

σ²

t ⁺^ϵ

· γj + ^β^j

γ ^,

(5.14)

Since the magnitude of the token t is shortening by extracting γ, the resulting X^′behaves

more friendly than ^˜X for quantization.

For the second ﬁnding, they discover that the inﬂuence of ﬁnal performance when clip-

ping the outliers varies greatly. In particular, when clipping the outliers and evaluating the

ﬁnal performance, they ﬁnd that the importance of outliers is highly varied. Take the out-

liers after GELU as an example. Fig. 5.9 shows that clipping the more aggressive outliers

sharply (clipping signals in 10-100 to 10) even does not hurt the full-precision performance

with accuracy still at 91.02. At the same time, the accuracy drops suddenly to 85.93 with

too many outliers cut. In addition, though those less important outliers might present in a

long tail form, they are only provided by a few tokens. In particular, unimportant outliers

which can be clipped without even any accuracy drop in FP models only correspond to

a few tokens. From the red points in Fig. 5.9, which represents the proportion of clipped

tokens, it can be clearly seen that the more aggressive outliers though occupy a large range

from 10 to 100 only matches with 3% tokens. Destroying those sharper outliers belonging

to a few tokens will not aﬀect the performance.